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PARALLEL PROCESSING METHOD OF AN EIGENVALUE PROBLEM FOR 
A SHARED -MEMORY TYPE SCALAR PARALLEL COMPUTER 

Cross-reference 

5 This application is a continuation-in-part 

application of U.S. patent application niimber 
10/289,648, filed on November 1, 2002, now abandoned. 

Background of the Invention 
10 Field of the Invention 

The present invention relates to matrix 
calculation in a shared-memory type scalar parallel 
computer. 

15 Description of the Related Art 

First, in order to solve the eigenvalue problem 
of a real symmetric matrix (matrix composed of real 
numbers, which does not changed even if the matrix 
elements are transposed) and an Hermitian matrix (matrix 

20 composed of complex numbers, which does not changed even 
if conjugated and transposed) (calculating X, in which 
detl A-A-I I = 0, and the eigenvector thereof if a matrix, 
a constant and a unit matrix are assumed to be A, X and 
I, respectively), tri-diagonalization (conversion into 

25 a matrix with a diagonal factor and adjacent factors 
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on both sides only) has been applied. Then, the 
eigenvalue problem of this tri-diagonal matrix is solved 
using a multi-section method. The eigenvalue is 
calculated and the eigenvector is calculated using an 
5 inverse repetition method. Then, Householder 
conversion is applied to the eigenvector, and the 
eigenvector of the original eigenvalue problem is 
calculated. 

In a vector parallel computer, an eigenvalue 

10 problem is calculated assuming that memory access is 
fast. However, in the case of a shared-memory type scalar 
parallel computer, the larger the matrix to be 
calculated, the greater the number of accesses to shared 
memory. Therefore, the performance of the computer is 

15 greatly decreased by accessing shared memory at low 
speed, which is a problem. Therefore, a matrix must be 
calculated effectively using a cache memory with fast 
access installed in each processor of a shared-memory 
type scalar parallel computer . Specifically, if amatrix. 

20 is calculated for each row or column, the number of 
accesses to shared memory increases . Therefore, amatrix 
must be divided into blocks and shared memory must be 
accessed after each processor processes data stored in 
a cache memory as much as possible. In this way, the 

25 ni:imber of accesses to shared memory can be reduced. In 



3 

this case, it becomes necessary for each processor to 
have a localized algorithm. 

In other words, since a shared-memory type 
parallel computer does not have fast memory access 
5 capability like a vector parallel computer, an algorithm 
must be designed to increase processing amount against 
accesses to shared memory. 

Sioinmary of the Invention 

10 It is an object of the present invention to provide 

a parallel processing method for calculating an 
eigenvalue problem at high speed in a shared-memory type 
scalar parallel computer. 

The parallel processing method of the present 

15 invention is a program enabling a computer to solve an 
eigenvalue problem on a shared-memory type scalar 
parallel computer. The method comprises dividing a real 
symmetric matrix or Hermitian matrix blocks, copying 
each divided block in the work area of memory and 

20 tri-diagonalizing the matrix using each product between 
the divided blocks; calculating an eigenvalue and an 
eigenvector based on the tri~diagonalized matrix; and 
converting the eigenvector by Householder conversion 
in order to transform the calculation into the parallel 

25 calculation of matrix calculations with a prescribed 



width of a block and calculating the eigenvector of the 
original matrix. 

According to the present invention, an eigenvalue 
problem can be solved with the calculation localized 
5 as much as possible in each processor of a shared-memory 
type scalar parallel computer. Therefore, delay due to 
frequent accesses to shared memory can be minimized, 
and the effect of parallel calculation can be maximized. 

10 Brief Description of the Drawings 

The present invention will be more apparent from 
the following detailed description in conjunction with 
the accompanying drawings, in which: 

Fig. 1 shows the hardware configuration of a 
15 shared-memory type scalar parallel computer assumed in 
the preferred embodiment of the present invention; 

Fig. 2 shows the algorithm of the preferred 
embodiment of the present invention (No. 1); 

Fig. 3 shows the algorithm of the preferred 
20 embodiment of the present invention (No. 2); 

Figs. 4A through 4F show the algorithm of the 
preferred embodiment of the present invention (No. 3) ; 

Fig. 5A through 5F show the algorithm of the 
preferred embodiment of the present invention (No. 4) ; 
25 Fig. 6 shows the algorithm of the preferred 
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embodiment of the present invention (No. 5) ; 

Fig. 7 shows the algorithm of the preferred 
embodiment of the present invention (No. 6) ; 

Fig. 8 shows the algorithm of the preferred 
5 embodiment of the present invention (No. 7) ; 

Fig. 9 shows the algorithm of the preferred 
embodiment of the present invention (No. 8); 

Fig. 10 shows the algorithm of the preferred 
embodiment of the present invention (No. 9) ; 
10 Fig. 11 shows the algorithm of the preferred 

embodiment of the present invention (No. 10); 

Fig. 12 shows the pseudo-code of a routine 
according to the preferred embodiment of the present 
invention (No. 1); 
15 Fig. 13 shows the pseudo-code of a routine 

according to the preferred embodiment of the present 
invention (No. 2); 

Fig. 14 shows the pseudo-code of a routine 
according to the preferred embodiment of the present 
20 invention (No. 3) ; 

Fig. 15 shows the pseudo-code of a routine 
according to the preferred embodiment of the present 
invention (No. 4); 

Fig. 16 shows the pseudo-code of a routine 
25 according to the preferred embodiment of the present 



invention (No. 5) ; 

Fig. 17 shows the pseudo-code of a routine 
according to the preferred embodiment of the present 
invention (No. 6); and 
5 Fig. 18 shows the pseudo-code of a routine 

according to the preferred embodiment of the present 
invention (No. 7) . 

Figs. 19 through 29 are flowcharts showing a 
pseudo-code process . 

10 

Description of the Preferred Embodiments 

In the preferred embodiment of the present 
invention, a blocked algorithm is adopted to solve the 
tri-diagonalization of the eigenvalue problem. The 

15 algorithm for calculating a divided block is recursively 
applied and the calculation density in the update is 
improved. Consecutive accesses to a matrix vector 
product can also be made possible utilizing symmetry 
in order to prevent a plurality of discontinuous pages 

20 of memory from being accessed. If data are read across 
a plurality of pages of cache memory, sometimes the data 
cannot be read at one time and the cache memory must 
be accessed twice. In this case, the performance of the 
computer degrades. Therefore, data is prevented from 

25 spanning a plurality of pages of cache memory. 



When applying Householder conversion to the 
eigenvector of a tri-diagonalized matrix and 
calculating the eigenvector of the original matrix, 
calculation density is improved by bundling every 80 
5 iterations of the Householder conversion and 
calculating three matrix elements. 

In the preferred embodiment of the present 
invention, conventional methods are used to calculate 
an eigenvalue based on a tri-diagonalized matrix and 
10 to calculate the eigenvector of the tri-diagonalized 
matrix. 

Fig. 1 shows the hardware configuration of a 
shared-memory type scalar parallel computer assumed in 
the preferred embodiment of the present invention. 

15 Each of processors 10-1 through 10-n has primary 

cache memory, and this primary cache memory is sometimes 
built into each processor. Each of the processors 10-1 
through 10-n is also provided with each of secondary 
cache memories 13-1 through 13-n, and each of the 

20 secondary cache memories 13-1 through 13-n is connected 
to an interconnection network 12. The interconnection 
network 12 is also provided with memory modules 11-1 
through 11-n, which are shared memories. Each of the 
processors 10-1 through 10-n reads necessary data from 

25 one of the memory modules, stores the data in one of 



the secondary cache memories 13-1 through 13-n or one 
of the primary cache memories through the 
interconnection network 12, and performs calculation. 

In this case, the speed of reading data from one 
5 of the memory module 11-1 through 11-n into one of the 
secondary cache memories 13-1 through 13-n or one of 
the primary cache memories and the speed of writing 
calculated data into one of the memory modules 11-1 
through 11-n from one of the primary cache memories is 

10 very low compared with the calculation speed of each 
of the processors 10-1 through 10-n. Therefore, the 
frequent occurrence of such reading or writing degrades 
the performance of the entire computer. 

Therefore, in order to keep the performance of the 

15 entire computer high, an algorithm that reduces the 
number of accesses to each of the memory modules 11-1 
through 11-n as much as possible and performs as much 
calculation as possible in a local system comprised of 
the secondary cache memories 13-1 through 13-n, primary 

20 cache memories and processors 10-1 through 10-n is 
needed. 

<Method for calculating an eigenvalue and an 
eigenvector> 

1 . Tri-diagonalization part 
25 1) Tri-diagonalization 
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a) Mathematical algorithm for divided 

tri-di agonal izat ion 

Amatrix is tri-diagonalized for each block width . 
Specifically, a matrix is divided into blocks and each 
5 divided block is tri-diagonalized using the following 
algorithm. 

Figs. 2 through 11 show the algorithm of the 
preferred embodiment of the present invention. 

Fig. 2 shows the process of the m-th divided block. 
10 In this case, a block is the rectangle with a column 
and a row, which are indicated by dotted lines, as each 
side shown in Fig. 2. 

For the process for a last block, the algorithm 
is applied to 2X2 matrix with block width 2 located in 
15 the left hand corner and then the entire process 
terminates . 

do i=l,blks 

stepl : Create a Householder vector u based on the (n+1 ) th 
row vector of An. 

20 step2: Calculate Vi=An+iU and Wi=Vi-u (u^v) /2 . 

step3 : Update as Ui= (Ui-i, Ui) and Wi= (Wi_i, wi) (In this case, 
(Ui_i,Ui) expands the matrix by one column by creating 
matrix Ui based on matrix Ui-i by adding one column) . 
step4 : if ( Kblks ) then 

25 Update the (n+i+l)th column of An. 
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An (*, n+i+1) =An (*,n+i+l) -UiWi (n+i+1) ^- 
WiUi (n+i+1, 
endif 

enddo 

5 steps : An+biks=An-UbiksWbiks^-WbiksUbiks^ 



• Tri-diagonalization by divided Householder conversion 
• -Explanation of Householder conversion 

V=(Vi,V2, .../Vn) 

10 I vNv*v=h^ 

If Uv= (h, 0, 0) there is the relationship of 

Uv=v- (Vi-h, V2, Vn) . 

U= (1-uu^/l u 1^) = ( 1- a uu^) , where u= (Vi-h, V2, Vn) . 

In the calculation below, a is neglected. 
15 An+i = U'^AnU = (l-uu"") Ad-uu'') 

= At, - UU^An - AnUU^ + UU^AnUU^ 

= An- uw^ - uu^u^v/2 - wu^u^v/2 + UU^U^V 

= An - UW^- WU^ (*) 

where w=v-u(u^v)/2 and v=AnU 
20 This is repeated, 

An+k = An - UkWk^ - WkUk"^ (**) 

As the calculation in the k-th step, Vn can be 
calculated according to equations (*) and (**) as 
follows . 

25 Vk = AnUk - Uk-iWk-i'^Uk - Wk-iUk-iV (***) 
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Wk=Vk-UkUk^Vk/2 

Uk= (Uk-i,Uk) , Wk= (Wk-i, Wk) 

b) Storage of information constituting Householder 
5 conversion 

The calculation of an eigenvector requires the 
Householder conversion, which has been used in the 

tri-diagonalization . For this reason, Un and a are 
stored in the position of a vector constituting the 
10 Householder conversion, a is stored in the position of 
a corresponding diagonal element. 

c) Method for efficiently calculating Ui 

In order to tri-diagonalize each block, the 
following vectors used for Householder conversion must 

15 be updated. In order to localize these calculations as 
much as possible, a submatrix of the given block width 
must be copied into a work area, is tri-diagonalized 
and is stored in the original area. Instead of updating 
a subsequent column vector for each calculation, 

20 calculation is performed in the form of a matrix product 
with improved calculation density. Therefore, the 
tri-diagonalization of each block is performed by a 
recursive program. 

recursive subroutine trid (width, block area 

25 pointer) 
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if (width<10) then 
c Tri-diagonalize the block with the width. 

Create vi and wi based on vector u needed for 
Householder conversion and a matrix vector product. 
5 Combine Ui and Wi with U and W, respectively. 

else 

c Divide a block width into halves. 

C Tri-diagonalize the former half block. 

call trid (width of the former half, area of the 
10 former half) 

c Divide a block and update the latter half divided by 

a division line. 

Update B=B-UW''-WU\ 

c Then, tri-diagonalize the latter half. 
15 call trid (width of the latter half/ area of the latter 

half) 

return 

end 

As shown in Fig. 3, a block is copied into a work 
20 area U and the block is tri-diagonalized by a recursive 
program. Since the program is recursive, the former half 
shown in Fig. 3 is tri-diagonalized when the recursive 
program is called for the update process of the former 
half. The latter half is updated by the former half and 
25 then is tri-diagonalized. 



As shown in Figs. 4A through 4F, when the recursive 
program is called to a depth of 2, the shaded portion 
shown in Fig. A is updated to B in the first former half 
process and then the shaded portion shown in Fig. 4C 
5 is updated and lastly the shaded portion shown in Fig. 
4F is updated. In parallel calculation at the time of 
update, the block matrix of the updated portion is evenly 
divided vertically into columns (divided in a row vector 
direction) , and the update of each portion is performed 

10 in parallel by a plurality of processors. 

The calculation of Fig. 4B is performed after the 
calculation of Fig. 4A, the calculation of Fig. 4D is 
performed after the calculation of Fig. C and the 
calculation of Fig. 4F is performed after the 

15 calculation of Fig. 4E. 

As shown in Fig. 5, when the shaded portion of U 
is updated, the horizontal line portion of u and the 
vertical line portion of W are referenced. In this way, 
calculation density can be improved. Specifically, Vn 

20 can be calculated according to the following equation 
(**) . 

An + k = An - UkWk^ - WkUk^ 

In this case, the reference pattern of U and W is 
determined according to the following equation (***) . 
25 Vk - AnUk - Uk-iWk-iV - Wk-iUk-iV (***) 



vjc is calculated for the tri-diagonalization of 
the updated portion after the update of U shown in Figs. 
4A and 4B, 4C and 4D, and 4E and 4F, U and W are referenced 
and Vk is calculated using a matrix vector product. Since 
5 this is just a reference, and the update and reference 
of U have a common part, U and W can be efficiently 
referenced. Instead of updating An each time, only a 
necessary portion is updated using U and W. Using 
equation the calculation speed of the entire 

10 update is improved, and performance is improved 
accordingly. Although equation (***) is extra 
calculation, it does not affect the performance of the 
entire calculation as long as the block width is kept 
narrow. 

15 For example, if four computers perform the 

parallel process, in the calculation of Wjc-i^Uk and Uk-i^Uk 
of equation , the shaded portion is divided in the 

direction of a vertical line (divided by horizontal 
lines), and parallel calculation is performed. As for 

20 the product of the results, the shaded portion is divided 
in the direction of a broken line, and parallel 
calculation is performed. 
Parallel calculation of Vi=AnUi 

As shown in Fig. 6, each processor divides the 

25 shaded portion in the second dimensional direction 
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utilizing the symmetry of An, that is, An=An^ and each 
processor calculates vi by An(*, ns:ne)^Ui. 
2) Parallel calculation in shared-memory type scalar 
parallel computer 
5 a) A storage area f or U and Wis allocated in shared memory. 
A block area to be tri-diagonalized is copied into a 
work area allocated separately and tri-diagonalization 
is applied to the area. 

The parallel calculation of the recursive program 
10 described above is as follows. 

(1) Necessary vectors are calculated according to the 
following equation of step 4 in order to calculate Ui 
needed to perform Householder conversion 

An (*,n+i + l) =An (*,n+i + l) -UiWi (n+i+1) 
15 WiUi (n+i+1, *) ^ 

(2) Vi is calculated in step 2 

This is calculated by making ui act on the 
following equation + ) . 

An+k = An - UkWk^ - WkUk^ 

20 In this calculation, the product of An and Ui, and 

the product of UkWk^-WkUk^ and Ui are processed in 
parallel . 

The block is copied in a work area and care must 
be paid so as not to update the necessary portion of 
25 An. The block is divided into matrices extended in a 
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column vector direction (divided into columns) 
utilizing the symmetry of An, and parallel calculation 
is performed. 

(3) In the recursive program, a block area is updated 
5 utilizing the following equation. 

An+k = An - UkWk^ - WkUk^ 

In this way, the amount of calculation of (1) is reduced. 
3) Update in step 5 

Utilizing symmetry during update, only the lower 

10 half of a diagonal element is calculated. In parallel 
calculation, if the number of CPUs is #CPU, in order 
to balance load, a sub-array, in which a partial matrix 
to be updated is stored, is evenly divided into 2x#CPU 
in the second dimensional direction and the CPUs are 

15 numbered from 1 to 2x#CPU. The i-th processor of each 
of 1 through #CPU updates in parallel the i-th and 

(2x#CPU+l-i) th divided sub-arrays. 

Then, calculated result is copied into the upper 
half. Similarly, this is also divided and the load is 

20 balanced. In this case, portions other than the diagonal 
block are divided into fairly small blocks so that data 
are not read across a plurality of pages of cache memory 
and are copied. The lower triangular matrix is updated 
by An+k=An-UkWk^-WkUk^ . In this case, the lower triangular 

25 matrix is divided into #CPUx2 of column blocks, two 
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outermost blocks, one at each end are sequentially 
paired. Each CPU updates such a pair. Fig. 7 shows a 
case where four CPUs are provided. 

After the lower triangular part is updated, the 
5 same pairs consisting of blocks 1 through 8 are 
transposed into an upper triangle portion and are copied 
into ul through u8 . 

In this case, the block is divided into small 
internal square blocks and is transposed using the cache. 
10 Then, the blocks are processed in parallel as during 
an update. 

Explanation on the improvement of the performance by 
transposition in the cache 

As shown in Fig. 8, square blocks are transposed 

15 and converted in ascending order of block numbers. The 
lower triangle of square area 1 is copied into the 
continuous area of memory, is transposed into rows by 
accessing in the direction of row and stored in the upper 
triangle of square area 1. Each square in the first 

2 0 column, namely squares 2 through 8, is copied and 
transposed into the corresponding square in the first 
row . 

2 . Calculation of eigenvectors 
a) Basic algorithm 
25 Vector Un is stored, then ( l-2*uu^/ (u^u) ) is 
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created and ( l-2*uu^/ (u^u) ) is multiplied by the vector . 

If tri-diagonalization is performed, the original 
eigenvalue problem can be transformed as follows. 

Qn-2....Q2QlAQl^Q2^...Q^-2^Qn-2....Q2QlX = A,Qn-2.-. • Q2Q1X 

5 Conversion is performed by calculating 

x=Qi^Q2^... •Qn-3^Qn-2^y based on the eigenvector y calculated 
by solving the tri-diagonalized eigenvalue problem, 
b) Block algorithm of the preferred embodiment of the 
present invention and parallel conversion calculation 
10 of eigenvectors 

When calculating many or all eigenvectors, the 
eigenvectors of tri-diagonal matrix are evenly assigned 
to each CPU, and each CPU performs the conversion 
described above in parallel. In this case, approximately 
15 80 conversion matrices are collectively converted. 

Each conversion matrix Qi^ can be expressed as 

l+aiUiUi^. The product of these matrices can be expressed 
as follows. 

n+k-l n +k- 1 
i=n j=l 

2 0 where 

bi,j: The collection of scalar coefficients other 
than UiUj at the leftmost and rightmost ends 

bi,j becomes an upper triangular matrix. Each 
conversion matrix Qi^ can be transformed into 1+UBU^. 
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Using this transformation, calculation density can be 
improved, and calculation speed can be improved 
accordingly. Fig. 9 shows a typical matrix B. 

Although the method described above has three 
5 steps, matrices to be processed become are U and B 
according to such memory access. Since B can be made 
fairly small, high efficiency can be obtained. After 
the (m-l)th bi,j is calculated, all bi,j is multiplied 

by (1+amUmUm^)/ and the following expression can be 
10 obtained. 

l+Z"i(ZbijU]) + «mU„U:„+U„2«n,u>i(ZMl) 

i j i j 

If i and j are swapped in the sum of the last term, the 
expression can be modified as follows. 

u„(2:(Z«mU>ib«)u;) 

15 The item located in the innermost parenthesis can 

be regarded as bm, j ( j=m+l , n+k) . In this case, bm,^ is 

A square work array W2 is prepared, and first, 

ociUiUj^ is stored in the upper triangle of w2(i,j). ai 
20 is stored in the diagonal element. 

The method described above can be calculated by 
sequentially adding one row on the top of each of the 

matrices upwards beginning with the 2x2 upper triangular 
matrix in the lower right corner. 



If each of the elements is calculated beginning 
with the rightmost row element, calculation can be 
performed in the same area since B is an upper triangular 
matrix and the updated portion is not referenced. In 
5 this way, a coefficient matrix located in the middle 
of three matrix products can be calculated using only 
very small areas. 

Fig. 10 shows a typical method for calculating the 
eigenvalue described above. 
10 Block width is assumed to be nbs. 

First, inner product ajUi-Uj is calculated and is 
stored in the upper half of B. 

ai is stored in the diagonal element. 
Then, calculation is performed as follows. 
15 do il=nbs-2, 1, -1 

do i2=nbs, il+1,-1 

sum=w2 (il, 12) 

do 13=12-1, il+1, -1 

sum=siim+w2 (11, 13) *w2 (13, 12) 
2 0 enddo 

w2 (il, 12) =sum 

enddo 

enddo 

do i2=nbs, 1,-1 
25 do 11=12-1,1,-1 
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w2 (il, i2)=w2 (il, 12) *w2 (12, 12) 

enddo 

enddo 

Fig. 11 shows a typical process of converting the 
5 eigenvector calculated above Into the eigenvector of 
the original matrix. 

The eigenvector Is converted by a Householder 
vector stored In array A. The converted vector is divided 
into blocks. The shaded portion shown in Fig. 11 is 
10 multiplied by the shaded portion of EV, and the result 
is stored in W. W2 is also created based on block matrix 
A. W2 and W are multiplied. Then, the block portion of 
A is multiplied by the product of W2 and W. Then, the 
shaded portion of EV is updated using the product of 
15 the block portion of A and the product of W2 and W. 
3 .Eigenvalue/eigenvector of Hermltian matrix 

An algorithm for calculating the 

eigenvalue/eigenvector of a Hermltian matrix replaces 
the transposition in the tri-diagonalization of a real 
20 symmetric matrix with transposition plus complex 

conjugation (t— >H) . A Householder vector is created by 
changing the magnitude of the vector in order to convert 
the vector into the scalar multiple of the original 
element . 

25 The calculated trl-dlagonal matrix is a Hermltian 
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matrix, and this matrix is scaled by a diagonal matrix 
with the absolute value of 1, 

A diagonal matrix is created as follows. 

di=l .0, di+i=hi-fi/l hi+i hdi 
5 Figs. 12 through 18 show the respective 

pseudo-code of routines according to the preferred 
embodiment of the present invention. 

Fig. 12 shows a subroutine for tri-diagonalizing 
a real symmetric matrix. 

10 Array a is stored in the lower triangle of a real 

symmetric matrix. The tri-diagonal matrix and 
sub-diagonal portion are stored in daig and sdiag, 
respectively. Information needed for conversion is 
stored in the lower triangle of a as output. 

15 U stores blocks to be tri-diagonalized. V is an 

area for storing W. 

nb is the number of blocks, and nbase indicates 
the start position of a block. 

After subroutine ^^copy" is executed, a block to 

20 be tri-diagonalized in u (nbase+1 : n, 1 : iblk) , routine 
blktrid is called and LU analysis is performed. Then, 
the processed u (nbase+1 : n, 1 : iblk) is written back into 
the original matrix a . In subsequent processes, the last 
remaining block is tri-diagonalized using subroutine 

25 blktrid. 
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Fig, 13 shows the pseudo-code of a 
tri-diagonalization subroutine. 

This subroutine is a routine for 
tri-diagonalizing block matrices and is recursively 
5 called, nbase is an offset indicating the position of 
a block, istart is the intra-block offset of a reduced 
sub-block to be recursively used, and indicates the 
position of the target sub-block. It is set to "1" when 
called for the first time, nwidth represents the width 

10 of a sub-block. 

If nwidth is less than 10, subroutine btunit is 
called. Otherwise, istart is stored in istart2, a half 
of nwidth is stored in nwidth2 . The sub-block is 
tri-diagonalized bysubroutine blktrid, and then 

15 Barrier synchronization is applied. 

Furthermore, the sum of istart and nwidth/2 is 
stored in istart3, and nwidth-nwidth/2 is stored in 
nwidth 3. Then, a value is set in is2, is3, ie2 and ie3, 
is and ie, each of which indicates the start or end 

20 position of a block, and len and iptr are also set. Then, 
after calculation is performed according to the 
expression shown in Fig. 13, the result is stored in 
u ( is : ie, is3 : ie3 ) , and Barrier synchronization is 
applied. Then, tri-diagonalization subroutine blktrid 

25 is called and the sub-block is processed. Then, the 
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subroutine process terminates. 

Fig. 14 shows the pseudo-code of the internal 
routine of a tri-diagonalization subroutine. 

In the internal tri-diagonalization subroutine 
5 btunit, after necessary information is stored, block 
start iptr2, width len, start position ^^is" and end 
position ie are determined, and Barrier synchronization 
is applied. Then, u (is : ie, i) t*u (is : ie, i) is stored in 
tmp, and Barrier synchronization is applied. Then, each 
10 value is calculated and is stored in a respective 
corresponding array. In this routine, sum and sqrt mean 
to sum and to calculate a square root. Lastly, Barrier 
synchronization is applied. 

Then, v(is:ie,i) is calculated, and Barrier 
15 synchronization is applied. Then, lens2, isx, iex, u 
and V are updated, and Barrier synchronization is 
applied. Furthermore, v(is:ie,i) is updated, and 
Barrier synchronization is applied. Furthermore, 
V ( is : ie, i) ^*u (is : ie, i) is calculated, tmp is stored and 
20 Barrier synchronization is applied. 

Then, a value is set in beta, and Barrier 
synchronization is applied. Then, v is updated by 
calculation using beta, and Barrier synchronization is 
applied. 

25 Then, if Kiblk and ptr2<n-2, u(is:ie,i+l) is 
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updated. Otherwise, u ( I : ie, i ; 1 : i+2) is updated using 
another expression and the process terminates. After 
the execution of this subroutine, the allocated threads 
are released. 

5 Fig. 15 shows the respective pseudo-code of a 

routine for updating the lower half of a matrix based 
on u and v, a routine for updating a diagonal matrix 
portion and a copy routine. 

In this code, nbase and nwidth are an offset 

10 indicating the position of a block and block width, 
respectively. 

In this subroutine update, after arrays a, u and 
V are allocated, Barrier synchronization is applied. 
Then, after blk, nbase2, len, isl, iel, nbase3, isr and 

15 ier are set, each of a ( iel : n, isl : iel ) and 
a (ier+1 :n, isr : ier) is updated. Then, a subroutine 
trupdate is called twice. Barrier synchronization is 
applied and the process is restored to the original 
routine . 

20 In subroutine copy, len, isl, lenl, nbase, isr and 

lenr are set, bandcp is executed twice and the process 
is restored to the original routine. 

Fig. 16 shows the pseudo-code of a routine copying 
an updated lower triangle in an upper triangle. 

25 In subroutine bandcp, nb, w, nn and loopx are set. 
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Then, in a loop do, TRL (a ( is2 : is2+nnx-l , is2 : is2+nnx) ) 
and TRL (w (I :nnx, 1 :nnx) ) ^ are stored in 

TRL (w (1 :nnx, 1 :nnx) ) and 
TRU (a (is2 : is2+nnx-l, is : is+nnx) ) , respectively. In 
5 this case, TRL and TRU represent a lower triangle and 
an upper triangle, respectively. 

Then, w ( 1 : nnx, 1 : nnx) and a ( is2 : is2+nnx, 
is3 : is3+nnx-l ) are updated. Then, w(l:ny, l:nx) and 
a ( is2 : is2+nnx, is3:n) are updated. 

10 Then, after the do loop has finished, the process 

is restored to the original routine. 

Fig. 17 shows the pseudo-code of a routine for 
converting the eigenvector of a tri-diagonal matrix into 
the eigenvector of the original matrix. 

15 In this case, the eigenvector of a tri-diagonal 

matrix is stored in ev ( 1 : n, 1 : nev) . a is the output of 
tri-diagonalization and stores information needed for 
conversion in a lower diagonal portion. 

Subroutine convev takes array arguments a and ev. 

20 Subroutine convev creates threads and performs a 

parallel process. 

Barrier synchronization is applied and len, is, 
ie and nevthrd are set. Then, routine convevthrd is 
called, and Barrier synchronized is applied after 

25 restoration and the process terminates. 
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Fig. 18 shows the pseudo-code of a routine for 
converting eigenvectors . 

In subroutine convevthrd, block width is stored 
in blk, and a, ev, w and w2 are taken as arrays. 
5 First, if width is less than 0, the original 

routine is restored without performing any process. In 
this case, nuitiblk and nfbs are set, and a value stored 
in a diagonal element at the time of tri-diagonalization 
with a code the reverse of the above (-a(i,i))is input 
10 in alpha, ev ( i+1 : n, 1 : iwidth) ^*a ( i+1 : n, i ) is input in 
x(l:iwidth), and ev is updated using 

ev(i+l :n, 1 : iwidth) ^*a (i+1 :n, i) , alpha and a. 
Furthermore, in a subsequent do sentence, is and ie are 
set, a (is+1 :n, is : ie) ^*ev (is+1 :n, 1 : iwidth) is replaced 
15 with a (is+1 :n, is : ie) ^*ev (is+1 :n, 1 : iwidth) and 

w (1 :blk, 1 : iwidth) is updated by TRL (a (ie+1 : is, 
is : ie) ) ^*ev (ie+1 : is, 1 : iwidth) . In this case, TRL is a 
lower triangular matrix. 

The diagonal element vector of a (is : ie, is : ie) is 
20 stored in the diagonal element vector DIAG(w2) of w2 . 

In a subsequent do sentence, w2 (il, 12) is updated 

by 

w2 (il, 12) * (a (is+12 :n, is+i2-l ) ^*a ( is+i2 : n, is; il-1) ) . 
Furthermore, in a subsequent do sentence, w2 (11,12) is 
25 updated by 
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w2 (il, 12) +w2 (il, il + l:i2-l) *w2 ( il + 1 : 12-1 , 12 ) . 

Furthermore, In a subsequent do sentence, 
w2(ll,12) Is updated by w2 (11, 12) *w2 (12, 12) • Then, 
w (1 : blk, 1 : Iwldth) , ev ( Is+n : n, 1 : Iwldth) and 

5 ev (le+1 : Is, 1 : Iwldth) are updated and the flow Is 
restored to the original routine. 

Figs. 19 through 29 are flowcharts showing a 
pseudo-code process . 

Fig. 19 Is a flowchart showing a subroutine trld 

10 for trl-dlagonallzlng a real symmetric matrix. In step 
SlO, shared arrays, A(k,n), dlag(n) and sdlag(n) are 
Inputted as subroutines, dlag and sdlag return the 
diagonal and sub-diagonal elements of a calculated 
trl-dlagonal matrix as output. Work areas U(n+l,lblk) 

15 and v (n+1 , Iblk) are reserved In the routine and are used 
In a shared attribute . In step Sll, threads are generated. 
In each thread, the total number of threads and a thread 
number assigned to each thread are set In local areas 
numthr and nothrd, respectively. Then, In each thread, 

20 the following Items are set. Block width Is set In Iblk, 
and nb= (n-2 + lblk-l ) /Iblk, nbase=0 and 1=1 are set. In 
step S12, It Is judged whether l>nb-l. If the judgment 
In step S12 Is positive, the flow proceeds to step S19. 
If the judgment In step S12 Is negative. In step S13, 

25 nbase= (1-1) xlblk, lstart=l and nwldth=lblk are set. In 
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step S14, a subroutine copy is called and the lower 
triangle is copied in the upper triangle. In step S15, 
a target area to which block tri-diagonalization is 
applied is copied in a work area U. Specifically, 
5 U (nbase+1 : n, 1 : iblk) <— A (nbase+1 :n, nbase+1 : nbase+iblk) 
is executed. In step S16, a subroutine blktrid is called 
and the area copied in U is tri-diagonalized (istart=l; 
the block width transfers iblk) . In step S17, the 
tri-diagonalized area is returned to an array A. 
10 Specifically, 

A(nbase+1 :n, nbase+1 :nbase+iblk) <— U (nbase+1 :n, 1 : iblk) 
is executed. In step S18, a subroutine update is called, 
and the lower triangle of 

A(nbase+liblk:n, nbase+iblk:n) is updated, and the flow 

15 returns to step S12. 

In step S19, nbase= (nb-1 ) xiblk, istart=l and 
iblk2=n-nbase are set. In step S20, the block- 
tri-diagonalization target area is copied in a work area 
U. Specifically, 

2 0 U (nbase+1 : n, 1 : nwidth) ^A (nbase+1 : n, nbase+1 : n) is 
executed. In step S21, a subroutine blktrid is called, 
and the copied area is tri-diagonalized (istart=l; the 
block width transfers iblk2) . In step S22, the 
tri-diaginalized area is returned to array A. 

25 Specifically, 
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A (nbase+1 : n, nbase+1 : n) <— U (nbase+1 : n, 1 : nwidth) is 
executed. In step 323, the threads generated for the 
parallel processing are deleted, and the subroutine 
terminates . 

5 Fig. 20 is a flowchart showing a subroutine 

blktrid. This subroutine is a recursive program. 

This subroutine is called by the following 
statement . 

Subroutine blktrid 
10 (A, k, n, dig, sdig, nbase, istart, nwidth, U, V, nothrd, 
numthrd) 

, where nbase is an offset indicating the position of 
a block, istart is an intra-block offset of a reduced 
sub-block to be recursively used and indicates the 

15 position of the target sub-block, which is set to ^^1" 
when called for the first time, and nwidth represents 
its block width . In step S25, it is judged whether nwidth 
<10. If the judgment in step S25 is negative, the flow 
proceeds to step 527. If the judgment in step S25 is 

20 positive, in step S26, a subroutine btunit is called, 
and tri-diagonalization is applied. Then, the 
subroutine terminates. In step S27, an update position 
and a block width which are used for recursive calling 
are changed, istart2=2istart and nwidth=nwidth/2 are 

25 set, and are transfered. The start position and width 
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of the reduced block are transfered. In step S28, a 
subroutine blktrid is recursively called. In step 329, 
barrier synchronization is applied between the threads. 
In step S30, a start position (is2,is3) and an end 
5 position (ie2,ie3), which are shared with each thread 
in update, are calculated. Specifically, 
istart3=istart+nwidth/2, nwidth3=nwidth-nwidth/2, 
is2=istart2 , ie2 = istart+nwidth2--l , is3=istart3 , 
ie3=istart3+nwidth3-l , iptr=nbase+istart3, 

10 len= (n-iptr+numthrd-1 ) /numthrd, 

is=iptr+ (nothrd-1) xlen+1 and 
ie=niin (n, iptr+nothrdxlen) are calculated. In step S31, 
U(is:ie,is3:ie3) =U (is: ie, is3:ie3) -U(is: ie, is2:ie2) xW 
(is3:ie3,is2:ie2) ^-W (is : ie, is2 : ie2) xU (is3 : ie3, is2 : ie 

15 2) ^ are calculated. InstepS32, barrier synchronization 
is applied between the threads . InstepS33, a subroutine 
blktrid is recursively called, and the subroutine 
terminates . 

Figs. 21 and 22 are flowcharts showing a 
20 subroutine btunit, which is an internal routine of 
subroutine blktrid. 

In step S35, tmp (numthrd) , sigma and alpha are 
assigned according to its shared attribute. In step S36, 
it is judged whether nbase+istart>n-2 . If the judgment 
25 in step S36 is positive, the subroutine terminates. If 
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the judgment in step S36 is negative, the flow proceeds 
to step S38. In this case, in step S38 i=istart is set. 

In step S39 it is judged whether i< istart-l+nwidth . If 
the judgment in step S39 is negative, the subroutine 
5 terminates. If the judgment in step S39 is positive, 
in step S40, start positions ""^is" and end positions ie 
which are shared with each thread are calculated. 
iptr2=nbase+i, len= (n-iptr2+numthrd-l ) /numthrd, 

is=iptr2+ (nothrd-1) xlen+1 and 

10 ie=min (n, iptr2+nothrdxlen) are calculated. In step 
S41, barrier synchronization is applied. In step S42, 
tmp (nothrd) =U (is : ie, i) U(is:ie,i) is calculated. In 
step S43, barrier synchronization is applied. In step 
S44, it is judged whether nothrd==l . If the judgment in 

15 step S44 is negative, the flow proceeds to step S46. 
If the judgment in step S44 is positive, in step S45, 
the square root of the sum of values partially calculated 
in each thread is calculated and is tri-diagonalized 
(generation of a householder vector) . 

2 0 sigma=sqrt (sum (tmp ( 1: numthrd) ) ) 

where ^^sum" and sqrt represent sum and square root, 
diag (iptr2) =u (iptr2, i) , sdiag (iptr2) =-sigma, 

U (nbase+i+1, i) =U (nbase+i+1, i) +sign (u (nbase+i+1, i) xsi 
gma, alpha=l . 0/ (sigmaxu (nbase+i+1, i) and 

25 U ( iptr2, i) =alpha are calculated, and the flow proceeds 



to step S46. In step S46, barrier synchronization is 
applied. In step S47, iptr3=iptr2 + l is calculated. In 
step S48, 
V ( is : ie, i ) =A ( iptr3 : n, iptr2+is : iptr2 + ie) (ptr3 : n, i ) 
5 is calculated. In step S49, barrier synchronization is 
applied. 

In step S50, 

V ( is : ie, i ) =alphaxV (is:ie, i)-V(is:ie, l:i-l)x(U( iptrS : 
n, 1 : i-1) ^xU (iptrS :n, i) ) -U (is : ie, 1 : i-1) x (V(iptr3 :n, 1 : 

10 i-l)^xU(iptr3:n, i) ) is calculated. In step S51, barrier 
synchronization is applied. In step S52, 
titip (nothrd) =V (is : ie, i) ^xU (is : ie, i) is calculated. In 
step S53, barrier synchronization is applied. In step 
S54, it is judged whether nothrd=l. If the judgment in 

15 step S54 is negative, the flow proceeds to step S56. 
If the judgment in step S54 is positive, the flow 
proceeds to step S55. In step S55, 

beta=0 . Sxalphaxsum ( 1 : numthrd) ) is calculated, where 
^'sum" is a symbol for summing vectors. In step S56, 
20 barrier synchronization is applied. In step S57, 

V (is : ie, i) =V (i : ie, i) -betaxU (is : ie, i) is calculated. 
In step S58, barrier synchronization is applied. In step 
S59, it is judged whether ptr2<n-2. If the judgment in 
step S59 is positive, in step S60, 
25 U (is; ie, i+1) =U (is : ie, i+1) -U (is : ie, istart : i) xV(i+l, is 
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tart : 1 ) ^-V (is : ie, is tart : i) xU (n+1 , is tart : 1 ) ^ is 
calculated, and the flow returns to step S39. If the 
judgment in step S59 is negative, in step S61, 
U ( is : ie, i+1 : i+2 ) =U ( is : ie, i+1 : i+2 ) -U ( is : ie, is tart : i ) x 
5 V ( i+1 : n, is tart : i ) ^-V (is : ie, i start : i) xU (n+1 : n, i start : 
i)^ is calculated and the subroutine terminates. 

Fig. 23 is a flowchart showing a subroutine 
update . 

In step S65, barrier synchronization is applied. 
10 In step S66, a pair is generated in each thread, and 
start and end positions, which are shared with each 
thread in update, are determined. Specifically, 

nbase2=nbase+iblk, 

len (n-nbase2+2xnumthrd-l ) / (2xnumthrd) , 
15 isl=nbase2+ (nothrd-1) xlen+1, 
iel=min (n, nbase2+nothrdxlen) , 

nbase3=nbase2+2xnumthrdxlen, isr=nbase3-nothrdxlen+l 
and ier=min (n, isr+len-1 ) are calculated. In step S67, 
A(iel+1 :n, isl : iel) =A(iel+l :n, isl+1 : n, isl : iel ) -W (iel 

2 0 +l:n, l:blk) xU (isl: iel, 1 : blk) ^-U ( iel ; 1 : n, 1 :blk) xW (isl 
: iel, 1 :blk) ^ and 
A(ier+l:n,isr:ier) =A(ier-i-l : n, isr : ier ) -W (ier+1 :n, 1 :b 
Ik) xU (isr : ier, 1 :blk) ^-U (ier+1 :n, 1 :blk) xW (isr : ier, 1 :b 
Ik)^ are calculated. In step S68, a subroutine trupdate 

25 is called, and a diagonal matrix in the left half is 
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updated, isl, iel. A, W and U are transfered. In step 
S69, subroutine trupdate is called and a diagonal matrix 
in the right half is updated, isr, ier. A, W and U are 
transfered. In step S70, barrier synchronization is 
5 applied/ and the subroutine terminates. 

Fig. 24 is a flowchart showing a subroutine 
trupdate (update of a diagonal matrix) . Update start 
position ^^is" and update end position ie are inputted, 
are used to update a rectangle located under the diagonal 

10 block before the subroutine is called. 

In step S75, block width for diagonal block update 
is set in blk2, and i=is is set. In step S76, it is judged 
whether i>ie-l. If the judgment in step S76 is positive, 
the subroutine terminates. If the judgment in step S7 6 

15 is negative, in step S77, update start and end positions 
in each thread are determined. Specifically, is2=i, 
ie2=min (i+blk2-l, ie-1 ) , 

A(is2 : ie-1, is2, ie2) =A(is2 : ie-1, is2, ie2) -U(is2 : ie-1, 
l:blk) xW(is2:ie2, 1 :blk) ^-W (is2 : iel-1, l:blk) xU(is2:ie 
20 2,l:blk)^ are calculated. In step 78, i=i+blk2 is set. 
The flow returns to step S76. 

Fig. 25 is a flowchart showing a subroutine copy. 
In step S80, a start position and width used to 
execute copying in parallel after making a pair in each 
25 thread, are calculated. Specifically, 
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len= (n-nbase+2xnumthrci-l ) / (2xnumthrd) / 
isl=nbase+ (nothrd-1) xlen+1, 

lenl=max (0,min (n- is 1+1, len9) and 

nbase3=nbase+2xnumthrdxlen, isr=nbase3-nothrdxlen+l 
5 and lenr=max (0,min (n-isr+1, len) ) are calculated. In 
step S81, a subroutine bandcp is called. An area, which 
is determined by a start position isl and width lenl 
on the left side of the pair, is copied. In step S82, 
subroutine bandcp is called, and an area, which is 
10 determined by a start position isr and width lenr on 
the right side of the pair, is copied. 

Fig. 2 6 is a flowchart showing a subroutine 
bandcp . 

This routine copies an area while transposing the 
15 the matrix on a cache, using a small work area WX. A 
start position and width are received in ^^is" and len, 
respectively, while work area is set as WX(nb,nb). 

In step S85, nn=min (nb, len) , loopx= (len+nn-1) /nn 
and j = l are calculated. In step S86, it is judged whether 
20 j>loopx. If the judgment in step S86 is positive, the 
subroutine terminates. If the judgment in step S86 is 
negative, in step S87, the size nnx and its offset ip 
of a diagonal block to be copied in WX are determined. 
Ip=is+ ( j-1) xnn, nl=len- ( j-1) xnn, nnx=min (nn, nl) , 
25 len2=n-ip-nnx+l, loopy= (len2+nn-l) /nn. 
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TRL (WX (1 :nnx, 1 :nnx) ) =TRL (A(ip: ip+nnx-1, ip: ip+nnx-1) 
)/ 

TRU (A(ip : ip+nnx-1, ip: ip+nnx-1) ) =TRL (WX ( 1 :nnx, i : nnx) 
), i=l,is2=ip and is3=ip+nnx are calculated, where TRU 
5 and TRL represent an upper triangle and a lower triangle, 
respectively. 

In step S88, it is judged whether i>loopy-l. If 
the judgment in step S88 is negative, in step S89, an 

area nnxnnx is transposed and copied. Specifically, 
10 WX ( 1 :nn, 1 :nnx) =A(is3 : is3+nn-l, is2 : is2+nnx-l) , 

A ( is2 : is2+nnx-l , is3 : is3+nn-l ) =WX ( 1 , nn : 1 , nnx) ^ and 
is3=is3+nn are calculated, and the flow returns to step 
S88. If the judgment in step S88 is positive, in step 
S90, the last part is copied. Specifically, nn=n-is3+l, 
15 WX ( 1 : nn, 1 : nx) =A ( is3 : n, is2 : is2+nnx-l) and 
A(is2 : is2+nnx-l, is3 :n) =WX (1 :nn, 1 :nx) are calculated 
and the flow returns to step S86. 

Fig. 27 is a flowchart showing a subroutine 
convev, 

20 In this routine, the number nev of eigenvectors 

to be calculated and a householder vector are stored 
in the lower half of ^'a". The eigenvectors of a 
tri-diagonal matrix are stored in ev(k,nev). 

In step S95, threads are generated. The total 

25 number of threads and their numbers (1 through numthrd) 
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are set in numthr and nothrd, respectively, of the local 
area of each thread. In step 396, barrier 
synchronization is applied. In step S79, start and end 
positions, which are shared with and calculated in each 
5 thread, are determined. Specifically, 

len= (nev+numthrd-l ) /nuitithrd, is= (nothrd-1) xlen+1, 

ie=min (nev, nothrdxlen) and width=ie-is+l are 
calculated. In step S98, a subroutine convevthrd is 
called, and the eigenvector of the tri-diagonal matrix 

10 is converted into that of the original matrix. An area 
where eigenvectors shared with each thread are stored 
and the number of eigenvectors ^'^width" are transfered. 
In step S99, barrier synchronization is applied. In step 
SlOO, the generated threads are deleted, and the 

15 subroutine terminates. 

Figs. 28 and 29 are flowcharts showing a 
subroutine convevthrd. 

This routine converts the eigenvectors of a 
tri-diagonal matrix, which are shared with each thread, 

20 into those of the original matrix. A vector and a 
coefficient that restore householder conversion are 
stored in array A. 

In step SI 10, a block width is set in blk. The block 
width is approximately 80. In step Sill, it is judged 

25 whether iwidth<0. If the judgement in the step Sill is 
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positive, the subroutine terminates. If the judgment 
in the step Sill is negative, the flow proceeds to step 
sll2. In step sll2, the first block to be converted in 
the following loop is obtained by sequentially 
5 calculating (1+auu^). Firstly, numblk= (n-2+blk-l ) /blk 
and nfbs=n-2-blkx (numblk-1) are calculated. In step 
S113, it is judged whether i<n-2-nfbs+l . If the judgment 
in step S113 is positive, the flow proceeds to step S114 . 
In step S114, alpha=-a (i, i) , x(l:iwidth) = 

10 a (i+1 :n, i) ^xev (i+1 :n, 1 : width) and 
ev (i+l : n, 1 : width) =ev (i+1 : n, 1 : width) +alphaxa ( i+1 : n, i) 
X (1 : iwidth) ^ are calculated, and the flow returns to step 
sll3. If the judgment in step S113 is negative, in step 
S115, i=l is set. In step S116, it is judged whether 

15 i>numblk-l. I the judgment in step S116 is negative, 
the subroutine terminates. If the judgment in step S116 
is positive, in step S117, U^xEV of (1+UBU^) in a block 
form is divided into an upper triangle matrix at the 
left end of and a rectangle on the right side, and 

20 they are separately calculated. Specifically, 
is=n-2- (nfns+lxblk) +1 and ie=ie+blk-l, 

W (1 :blk, iwidth) =a ( ie+1 : n, is : ie) ^x 

ev (is+1 : ie, 1 : iwidth) , W(l:blk-1, 1 : iwidth) =w (1 :blk-l, 

1 : iwidth) +TRL (a (is+1 : ie, is : ie-1) ) ^xev (is+1 : ie, 
25 Iriwidth) are calculated. Then, B of (1+UBU^) in a block 
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form is calculated, 

diag (w2) =-diag (a (is : is+blk-1, is :blk-l) ) and i2=blk: 

are calculated. A coefficient a 

corresponding to w2 is stored. In the above description, 
5 TRL(w2) and diag (x) represent the lower triangle matrix 
of w2 and the diagonal element of x, respectively. 

In step S118, it is judged whether i2<l. If the 
judgment in step S118 is negative, in step S119, the 

inner product of a householder vector xa is stored in 

10 the upper triangle of w2, and il=i2-l is set. In step 
sl20, it is judged whether il<l. If the judgment in step 
S120 is negative, in step S121, 

w2 (il, 12) =w2 (il, il) X (a (is+i2 :n, is + i2-l) ^xa (is + i2 :n, i 
s+il-1)) and il=il-l are calculated, and the flow 

15 returns to step S120. If the judgment in step S120 is 
positive, in step S122, i2=i2-l is set, and the flow 
returns to step S118. If the judgment in step S118 is 
positive in step S123, il=blk-2 is set, and then, an 
expansion coefficient is calculated in a double loop. 

20 The upper side of a triangle matrix is determined from 
right to left, and is calculated in such a way as to 
pile it up. This corresponds to determining a 
coefficient by adding expansion obtained by applying 
householder conversion from the left. In step sl24, it 

25 is judged whether il<l. If the judgment in step S124 



is negative, in step S125, i2=blk is set. In step S126, 
it is judged whether i2<il+l. If the judgment in step 
S126 is negative, in step S127, the elements of the upper 
side are determined from left to right. In this case, 
5 an immediately preceding coefficient is used. 
Specifically, 

w2 (i, 12) =w2 (il, 12) +w2 (11, il+l : 12-1) xw2 (11+1 : 12-1, 12) 
and 12 = 12-1 are calculated, and the flow returns to step 
S126. If the judgment in step S126 is positive, in step 
10 S128, 11-11-1 is set, and the flow returns to step S124. 
If the judgment in step S124 is positive, the flow 
proceeds to step S 129, and i2=blk is set. In step sl30, 
it is judged whether i2<l. If the judgment in step S130 

is negative, in step S131, coefficient a, which lacks, 
15 is multiplied in the following loop. Firstly, 11=12-1 
is set. In step S132,it is judged whether IKl. If the 
judgment in step S132 is negative, in step S133, 

w2 (11 . 12) =w2 (12, 12) xw2 (12, 12) and 11=11-1 are 
calculated, and the flow returns to step S132. If the 

20 judgment in step S132 is positive, in step S134, 12=12-1 
is set, and the flow returns to step S130. If the judgment 
in step S130 is positive, in step S135, BU^ is calculated 
and is stored in W. 

W ( 1 : blk, 1 : iwidth) =TRU ( w2 ) xW ( 1 : blk, 1 : iwidth) is 

25 calculated. Then, ( 1+UBU^) xEV is calculated using a 
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triangle located in the upper section of U, a rectangle 
located in the lower section of U and BU^ stored in W. 
Specifically, 

ev ( ie+1 : n, 1 : width) =ev (ie+1 : n, 1 : width) +a (ie+1 : n, is : i 
5 e) xW (1 :blk, 1 : width) , ev(is+l:ie, 1 : width) =ev (is+1 : ie, 
1 :width) +TRL (a (is+1 : ie, is+l:ie) ) xW(l:blk-l, 

Irwidth) is calculated, and the flow returns to step 
S115. 

According to the present invention, a 

10 high-performance and scalable eigenvalue/eigenvector 
parallel calculation method can be provided using a 
shared-memory type scalar parallel computer. 

According to the preferred embodiment of the 
present invention, in particular, the speed of 

15 eigenvector conversion calculation can be improved to 
be about ten times as fast as the conventional method. 
The eigenvalue/eigenvector of a real symmetric matrix 
calculated using these algorithms can also be calculated 
using Sturm's method and an inverse repetition method. 

20 The speed of calculation using seven CPUs is 6.7 times 
faster than the function of the numeric value 
calculation library of SUN called SUN performance 
library. The speed of the method of the present invention 
is also 2.3 times faster than a method for calculating 

25 the eigenvalue/eigenvector of a tri-diagonal matrix by 
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a ^Mivide & conquer" method, of another routine from 
SUN (in this case, it is inferior in function: 
eigenvalue/eigenvector cannot be selectively 
calculated) . 

5 The eigenvalue/eigenvector of a Hermitian matrix 

obtained using these algorithms can also be calculated 
using Sturm's method and an inverse repetition method. 
The speed of the method of the present invention using 
seven CPUs is 4.8 times faster than the function of the 

10 numeric value calculation library of SUN called the SUN 
performance library. The speed of the method of the 
present invention is also 3.8 times faster than a method 
for calculating the eigenvalue/eigenvector of a 
tri-diagonal matrix by a ^Mivide 6e conquer" method, of 

15 another routine of SUN (in this case, it is inferior 
in function: eigenvalue cannot be selectively 
calculated) . 

For basic algorithms of matrix computations, see 
the following textbook: 

20 G.H.Golub and C.F.Van Loan, ^'Matrix 

Computatrions" the third edition. The Johns Hopkins 
University Press (1996). 

For the parallel calculation of 

tri-diagonalization, see the following reference: 

25 J.Choi, J. J. Dongarra and D.W.Walker, The Design 



44 



of a Parallel Dense Linear Algebra Software Library: 
Reduction to Hessenberg, Traditional, and Bi-diagonal 
Form", Engineering Physics and Mathematics Division, 
Mathematical Sciences Section, prepared by the Oak Ridge 
National Laboratory managed by Martin Marietta Energy 
System, Inc., for the U. S . Department of Energy under 
Contract No. DE-AC05-840R21400, ORNL/TM-12472 . 

In this way, a high-performance and scalable 
eigenvalue/eigenvector calculation method can be 
realized. 



